Feature selection from heterogeneous biomedical data

نویسنده

  • Jérôme Paul
چکیده

Modern personalised medicine uses high dimensional genomic data to perform customised diagnostic/prognostic. In addition, physicians record several medical parameters to evaluate some clinical status. In this thesis we are interested in jointly using those different but complementary kinds of variables to perform classification tasks. Our main goal is to provide interpretability to predictive models by reducing the number of used variables to keep only the most relevant ones. Selecting a few variables that allow us to predict some clinical outcome greatly helps medical doctors to understand the studied biological process better. Mixing gene expression data and clinical variables is challenging because of their different nature. Indeed genomic measurements are expressed on a continuous scale while clinical variables can be continuous or categorical. While the biomedical domain is the original incentive to this work, we tackle the more general problem of feature selection in the presence of heterogeneous variables. Few variable selection methods jointly handle both kinds of features directly. That is why we focus on tree ensemble methods and kernel approaches. Tree ensemble methods, like random forests, successfully perform classification from data with heterogeneous variables. In addition, they propose a feature importance index that can rank variables according to their importance in the predictive model. Yet, that index suffers from two main drawbacks. Firstly, the provided feature rankings are highly sensitive to small variations of the datasets. Secondly, while the variables are accurately ranked, it is very difficult to decide which features actually play a role in the decision process. This work puts forward solutions to those two problems. We show in an analysis of tree ensemble methods stabilities that feature rankings get considerably stabler by growing more trees than needed to obtain good predictive performances. We also introduce a statistically interpretable feature selection index. It assesses

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Hybrid Feature Subset Selection Algorithm for the Analysis of Ovarian Cancer Data Using Laser Mass Spectrum

Introduction: Amajor problem in the treatment of cancer is the lack of an appropriate method for the early diagnosis of the disease. The chemical reaction within an organ may be reflected in the form of proteomic patterns in the serum, sputum, or urine. Laser mass spectrometry is a valuable tool for extracting the proteomic patterns from biological samples. A major challenge in extracting such ...

متن کامل

Fast SFFS-Based Algorithm for Feature Selection in Biomedical Datasets

Biomedical datasets usually include a large number of features relative to the number of samples. However, some data dimensions may be less relevant or even irrelevant to the output class. Selection of an optimal subset of features is critical, not only to reduce the processing cost but also to improve the classification results. To this end, this paper presents a hybrid method of filter and wr...

متن کامل

Mental Arithmetic Task Recognition Using Effective Connectivity and Hierarchical Feature Selection From EEG Signals

Introduction: Mental arithmetic analysis based on Electroencephalogram (EEG) signal for monitoring the state of the user’s brain functioning can be helpful for understanding some psychological disorders such as attention deficit hyperactivity disorder, autism spectrum disorder, or dyscalculia where the difficulty in learning or understanding the arithmetic exists. Most mental arithmetic recogni...

متن کامل

Feature Selection in Structural Health Monitoring Big Data Using a Meta-Heuristic Optimization Algorithm

This paper focuses on the processing of structural health monitoring (SHM) big data. Extracted features of a  structure are reduced using an optimization algorithm to find a minimal subset of salient features by removing noisy, irrelevant and redundant data. The PSO-Harmony algorithm is introduced for feature selection to enhance the capability of the proposed method for processing the  measure...

متن کامل

Population-Based Feature Selection for Biomedical Data Classification

Classification of biomedical data plays a significant role in prediction and diagnosis of disease. The existence of redundant and irrelevant features is one of the major problems in biomedical data classification. Excluding these features can improve the performance of classification algorithm. Feature selection is the problem of selecting a subset of features without reducing the accuracy of t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015